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Abstract 

We describe a method to perform functional operations on probability distri¬ 
butions of random variables. The method uses reproducing kernel Hilbert space 
representations of probability distributions, and it is applicable to all operations 
which can be applied to points drawn from the respective distributions. We refer 
to our approach as kernel probabilistic programming. We illustrate it on synthetic 
data, and show how it can be used for nonparametric structural equation models, 
with an application to causal inference. 


1 Introduction 

Data types, derived structures, and associated operations play a crucial role for pro¬ 
gramming languages and the computations we carry out using them. Choosing a data 
type, such as Integer, Float, or String, determines the possible values, as well as the 
operations that an object permits. Operations typically return their results in the form 
of data types. Composite or derived data types may be constructed from simpler ones, 
along with specialised operations applicable to them. 

The goal of the present paper is to propose a way to represent distributions over data 
types, and to generalize operations built originally for the data types to operations ap¬ 
plicable to those distributions. Our approach is nonparametric and thus not concerned 
with what distribution models make sense on which statistical data type (e.g., binary, 
ordinal, categorical). It is also general, in the sense that in principle, it applies to all 
data types and functional operations. The price to pay for this generality is that 

• our approach will, in most cases, provide approximate results only; however, we 
include a statistical analysis of the convergence properties of our approximations, 
and 
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• for each data type involved (as either input or output), we require a positive 
definite kernel capturing a notion of similarity between two objects of that type. 
Some of our results require, moreover, that the kernels be characteristic in the 
sense that they lead to injective mappings into associated Hilbert spaces. 

In a nutshell, our approach represents distributions over objects as elements of a Hilbert 
space generated by a kernel, and describes how those elements are updated by opera¬ 
tions available for sample points. If the kernel is trivial in the sense that each distinct 
object is only similar to itself, the method reduces to a Monte Carlo approach where the 
operations are applied to sample points which are propagated to the next step. Com¬ 
putationally, we represent the Hilbert space elements as finite weighted sets of objects, 
and all operations reduce to finite expansions in terms of kernel functions between 
objects. 

The remainder of the present article is organized as follows. After describing the 
necessary preliminaries, we provide an exposition of our approach ("Section [3j. Sec- 
tion[4]analyses an application to the problem of cause-effect inference using structural 
equation models. We conclude with a limited set of experimental results. 


2 Kernel Maps 

2.1 Positive definite kernels 

The concept of representing probability distributions in a reproducing kernel Hilbert 
space (RKHS) has recently attracted attention in statistical inference and machine 
learning Il2l l36l . One of the advantages of this approach is that it allows us to apply 
RKHS methods to probability distributions, often with strong theoretical guarantees 
mm. It has been applied successfully in many domains such as graphical models 
|3ji 39j|. two-sample testing HE domain adaptation mmm, and supervised learn¬ 
ing on probability distributions 1241148 1. We begin by briefly reviewing these methods, 
starting with some prerequisites. 

We assume that our input data {xi,..., x m } live in a nonempty set 3£ and are gen¬ 
erated i.i.d. by a random experiment with Borel probability distribution p. By k, we 
denote a positive definite kernel on x , i.e., a symmetric function 

k : Sfi x 3C -)• K. (1) 

(x,xf) H > k(x,x > ) (2) 

satisfying the following nonnegativity condition: for any m £ N, and a t ..... a m £ R, 

m 

aiajk(xi,Xj) > 0. (3) 

U= i 

If equality in Q implies that a\ = ■ ■ ■ = a m = 0, the kernel is called strictly positive 
definite. 
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2.2 Kernel maps for points 

Kernel methods in machine learning, such as Support Vector Machines or Kernel PCA, 
are based on mapping the data into a reproducing kernel Hilbert space (RKHS) SC 

0 SUEDES ED, 


4>: SC ->• JC (4) 

(5) 

where the feature map (or kernel map ) <t> satisfies 

k(x,x r ) = ( < t , (x),<f , (x , )) (6) 

for all x,x! £ SC. One can show that every k taking the form (|6) is positive definite, 
and every positive definite k allows the construction of SC and <t> satisfying |6|. The 
canonical feature map, which is what by default we think of whenever we write <t>, is 

<f>: SC -H- (7) 

xi —tk(x,.), (8) 

with an inner product satisfying the reproducing kernel property 

k(x,x') = (k{x,.),k(x',.)). (9) 

Mapping observations x £ SC into a Hilbert space is rather convenient in some cases. 
If the original domain SC has no linear structure to begin with (e.g., if the x are strings 
or graphs), then the Hilbert space representation provides us with the possibility to 
construct geometric algorithms by using the inner product of JC. Moreover, even if 
SC is a linear space in its own right, it can be helpful to use a nonlinear feature map 
in order to construct algorithms that are linear in JC while corresponding to nonlinear 
methods in SC. 


2.3 Kernel maps for sets and distributions 

One can generalize the map <t> to accept as inputs not only single points, but also sets 
of points or distributions. It was pointed out that the kernel map of a set of points 
X = {xi,..., x jn }, 

i m 

At|X]:=-E*(x«), (10) 

m ti 

corresponds to a kernel density estimator in the input domain ll33l l32ll . provided the 
kernel is nonnegative and integrates to 1. However, the map ( fT0| can be applied for 
all positive definite kernels, including ones that take negative values or that are not 
normalized. Moreover, the fact that /i [X] lives in an RKHS and the use of the associ¬ 
ated inner product and norm will have a number of subsequent advantages. For these 
reasons, it would be misleading to think of ( [TO} simply as a kernel density estimator. 

The kernel map of a distribution p can be defined as the expectation of the feature 
map 12 EH HD, 

M[p] := E *~/>[3>(x)]i (11) 
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II 

> 

5" 

mean of X 

+ 

ii 

5" 

moments of X up to order n € N 

k(x,x J ) strictly p.d. 

all of X (i.e., p injective) 


Table 1: What information about a sample X does the kernel map p [X] (see ( p~Q] >) 
contain? 


II 

> 

5" 

expectation of p 

k(x,x J ) = (( x,x') + 1)" 

moments of p up to order n £ N 

k^x^x 1 ) characteristic/universal 

all of p (i.e., p injective) 


Table 2: What information about p does the kernel map p\p] (see (jTT|)) contain? For 
the notions of characteristic/universal kernels, see mmm; an example thereof is the 
Gaussian kernel 126}. 


where we overload the symbol p and assume, here and below, that p is a Borel proba¬ 
bility measure, and 

[£(*>*')] < °°- ( 12 ) 

A sufficient condition for this to hold is the assumption that there exists anMgK such 
that 

||k(x,.)||<M<oo (13) 

or equivalently k(x ,x) < M 2 , on the support of p. Kernel maps for sets of points or 
distributions are sometimes referred to as kernel mean maps to distinguish them from 
the original kernel map. Note, however, that they include the kernel map of a point as a 
special case, so there is some justification in using the same term. If p = px is the law 
of a random variable X, we sometimes write p [X] instead of p [p\. 

In all cases it is important to understand what information we retain, and what we 
lose, when representing an object by its kernel map. We summarize the known results 

147] [U [36] 011 ESI in Tables [T] an d@ 

We conclude this section with a discussion of how to use kernel mean maps. To this 
end, first assume that <f> is injective, which is the case if k is strictly positive definite (see 
Table [1} or characteristic/universal (see Table [2]. Particular cases include the moment 
generating function of a RV with distribution p. 


M p {.)= E 


•x~p 


<*.•> 


(14) 


which equals © for k(x.x') = e i ' x '^ } using <0. 

We can use the map to test for equality of data sets. 


or distributions. 


H/xfX] — /j [X']|| = 0^=>X = X', 

\\n\p]-n\p']\\ = o <==>/? = //. 


(15) 

(16) 
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Two applications of this idea lead to tests for homogeneity and independence. In the 
latter case, we estimate ||ju[/?.tPy] — /J- 11 HIED; in the former case, we estimate 

\\ll\p\-V\p']\\ El- 

Estimators for these applications can be constructed in terms of the empirical mean 
estimator (the kernel mean of the empirical distribution) 

i m 

M [An] = =/*[*]. d 7 ) 

m tl 

where X = {jq,..., x m } is an i.i.d. sample from p (cf. ©)- As an aside, note that 
using ideas from James-Stein estimation El, we can construct shrinkage estimators 
that improve upon the standard empirical estimator (see e.g., EH3§1). 

One can show that u \p m \ converges at rate in k 2 (cf. l36l and [37] Theorem 27]): 

Theorem 1 Assume that ||/||oo < 1 for all f £ with \\f\\j)? < 1. Then with proba¬ 
bility at least 1—5, 


\\n\pm\-H\p]\\jtr < -E 
m 


Vt7k 


21n(2/5) 


(18) 


where Kjj := k(x,,Xj). 

Independent of the requirement of injectivity, p can be used to compute expecta¬ 
tions of arbitrary functions / living in the RKHS, using the identity 


E .x~ P [f(x)] = {n\p],f), 

which follows from the fact that k represents point evaluation in the RKHS, 


(19) 


fix) = (k(x,.),f). 


( 20 ) 


A small RKHS, such as the one spanned by the linear kernel 

k{x,x') = {x,xf), 


( 21 ) 


may not contain the functions we are interested in. If, on the other hand, our RKHS 
is sufficiently rich (e.g., associated with a universal kernel |46]| ). we can use © to 
approximate, for instance, the probability of any interval (a, b) on a bounded domain, 
by approximating the indicator function I(a,b) as a kernel expansion Yf=\ Uik{xj ..), and 
substituting the latter into CD- See GD for further discussion. Alternatively, if p has a 
density, we can estimate it using methods such as reproducing kernel moment matching 
and combinations with kernel density estimation ED HD - 

This shows that the map is not a one-way road: we can map our objects of interest 
into the RKHS, perform linear algebra and geometry on them 1341 . and at the end 
answer questions of interest. In the next section, we shall take this a step further, and 
discuss how to implement rather general operations in the RKHS. 

Before doing so, we mention two additional applications of kernel maps. We can 
map conditional distributions and perform Bayesian updates SHEDl), and we can 
connect kernel maps to Fourier optics, leading to a physical realization as Fraunhofer 
diffraction lfl4l . 
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3 Computing Functions of Independent Random Vari¬ 
ables 

3.1 Introduction and Earlier Work 

A random variable (RV) is a measurable function mapping possible outcomes of an un¬ 
derlying random experiment to a set E (often, E C but our approach will be more 
general). The probability measure of the random experiment induces the distribution of 
the random variable. We will below not deal with the underlying probability space ex¬ 
plicitly, and instead directly start from random variables X,Y with distributions px-Pv 
and values in 5X ,$/. Suppose we have access to (data from) px and py, and we want 
to compute the distribution of the random variable f(X,Y), where / is a measurable 
function defined on St~ x 'X/. 

For instance, if our operation is addition f(X,Y) = X + Y, and the distributions 
px and py have densities, we can compute the density of the distribution of f(X,Y) 
by convolving those densities. If the distributions of X and Y belong to some para¬ 
metric class, such as a class of distributions with Gaussian density functions, and if 
the arithmetic expression is elementary, then closed form solutions for certain favor¬ 
able combinations exist. At the other end of the spectrum, we can resort to numerical 
integration or sampling to approximate f(X,Y). 

Arithmetic operations on random variables are abundant in science and engineer¬ 
ing. Measurements in real-world systems are subject to uncertainty, and thus subse¬ 
quent arithmetic operations on these measurements are operations on random variables. 
An example due to l42l is signal amplification. Consider a set of n amplifiers connected 
together in a serial fashion. If the amplification of the ;-th amplifier is denoted by A), 
then the total amplification, denoted by Y, is Y = X\ ■ Xi - ■ -X,,, i.e., a product of n 
random variables. 

A well-established framework for arithmetic operation on independent random 
variables (iRVs) relies on integral transform methods ll42ll . The above example of 
addition already suggests that Fourier transforms may help, and indeed, people have 
used transforms such as the ones due to Fourier and Mellin to derive the distribution 
function of either the sum, difference, product, or quotient of iRVs J3] |43] 130] 1421 . ||49l 
proposes an approximation using Laguerre polynomials, and a notion of envelopes 
bounding the cumulative distribution function. This framework also allows for the 
treatment of dependent random variables, but the bounds can become very loose after 
repeated operations. ETl approximate the probability distributions of the input random 
variables as mixture models (using uniform and Gaussian distributions), and apply the 
computations to all mixture components. 

|(T8| considers a numerical approach to implement arithmetic operations on iRVs, 
representing the distributions using piecewise Chebyshev approximations. This lends 
itself well to the use of approximation methods that perform well as long as the func¬ 
tions are well-behaved. Finally, Monte Carlo approaches can be used as well, and are 
popular in scientific applications (see e.g., 0). 

The goal of the present paper is to develop a derived data type representing a distri¬ 
bution over another data type, and to generalize the available computational operations 
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to this data type, at least approximately. This would allow us to conveniently handle 
error propagation as in the example discussed earlier. It would also help us perform in¬ 
ference involving conditional distributions of such variables given observed data. The 
latter is the main topic of a subject area that has recently begun to attract attention, 
probabilistic programming m. A variety of probabilistic programming languages 
has been proposed [50J27]|4|. To emphasize the central role that kernel maps play in 
our approach, we refer to it as kernel probabilistic programming (KPP). 


3.2 Computing Functions of Independent Random Variables using 
Kernel Maps 

The key idea of KPP is to provide a consistent estimator of the kernel map of an ex¬ 
pression involving operations on random variables. This is done by applying the ex¬ 
pression to the sample points, and showing that the resulting kernel expansion has the 
desired property. Operations involving more than one RV will increase the size of the 
expansion, but we can resort to existing RKHS approximation methods to keep the 
complexity of the resulting expansion limited, which is advisable in particular if we 
want to use it as a basis for further operations. The benefits of KPP are three-fold. 
First, we do not make parametric assumptions on the distributions associated with the 
random variables. Second, our approach applies not only to real-valued random vari¬ 
ables, but also to multivariate random variables, structured data, functional data, and 
other domains, as long as positive definite kernels can be defined on the data. Finally, 
it does not require explicit density estimation as an intermediate step, which is difficult 
in high dimensions. 

We begin by describing the basic idea. Let / be a function of two independent 
RVs X,Y taking values in the sets :X, 3k, and suppose we are given i.i.d. ///-samples 
jf 1 ,... ,x m and y |,... ,y m . We are interested in the distribution of f(X. Y), and seek to 
estimate its representation p[f(X,Y)] := E[<t>(/(A,T))] in the RKHS as 

j m 

-2 I *(/(*/,?,))• (22) 

u= i 

Although xi,... ,x m ~ px and yi.... ,y„, ~ py are i.i.d. observations, this does not im¬ 
ply that the {f{xi,yj)\i,j = 1,... ,m} form an i.i.d. m 2 -sample from f(X,Y), since — 
loosely speaking — each x, (and each yf) leaves a footprint in /// of the observations, 
leading to a (possibly weak) dependency. Therefore, Theorem [T] does not imply that 
|22) is consistent. We need to do some additional work: 

Theorem 2 Given two independent random variables X. Y with values in 'X'. 'W, mu¬ 
tually independent i.i.d. samples x \,... ,x m and yi,...,y n , a measurable function f : 
Sfi x fX —> X, and a positive definite kernel on x with RKHS map <t>, then 

i m n 

-IE,*</<«,» (23) 

1=17=1 

is an unbiased and consistent estimator of fi[f(X,Y)\. 
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Moreover, we have convergence in probability 

i m n 

—Zi®(Axi, yj ))-nw(,x,Y))} 

11 i=lj= 1 

As an aside, note that ( |23| > is an RKHS valued two-sample U-statistic. 

Proof For any i,j, we have E[<t>(/(x;,y 7 -))] = E[<t>(/(A,7))], hence ( f23j i is unbiased. 

The convergence ( [24] ) can be obtained as a corollary to Theorem[3] and the proof is 
omitted here. 

3.3 Approximating Expansions 

To keep computational cost limited, we need to use approximations when performing 
multi-step operations. If for instance, the outcome of the first step takes the form 
( |23[ , then we already have m ■ n terms, and subsequent steps would further increase the 
number of terms, thus quickly becoming computationally prohibitive. 

We can do so by using the methods described in Chapter 18 of {34]. They fall in two 
categories. In reduced set selection methods, we provide a set of expansion points (e.g., 
all points f(x,i,yf) in (|23j)), and the approximation method sparsifies the vector of ex¬ 
pansion coefficients. This can be for instance done by solving eigenvalue problems or 
linear programs. Reduced set construction methods, on the other hand, construct new 
expansion points. In the simplest case, they proceed by sequentially finding approxi¬ 
mate pre-images of RKHS elements. They tend to be computationally more demanding 
and suffer from local minima; however, they can lead to sparser expansions. 

Either way, we will end up with an approximation 

t W&fo) (25) 

k=\ 

of ( |23l >, where usually p <C m ■ n. Here, the Zk are either a subset of the /’(x,■,}’/), or 
other points from 

It is instructive to consider some special cases. For simplicity, assume that = W 1 . 
If we use a Gaussian kernel 

k(x,x') = exp(—||x —x'|| 2 /(2 a 1 )) (26) 

whose bandwidth cr is much smaller than the closest pair of sample points, then the 
points mapped into the RKHS will be almost orthogonal and there is no way to sparsify 
a kernel expansion such as ( |23[ > without incurring a large RKHS error. In this case, we 
can identify the estimator with the sample itself, and KPP reduces to a Monte Carlo 
method. If, on the other hand, we use a linear kernel k(z,z') = (z,z!) on = R. d , 
then <t> is the identity map and the expansion ( |23[ > collapses to one real number, i.e., 
we would effectively represent f(X,Y) by its mean for any further processing. By 
choosing kernels that lie ‘in between’ these two extremes, we retain a varying amount 
of information which we can thus tune to our wishes, see Table|I] 


= 0 D 


+ —j= I , (m,n— ►«>). (24) 
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3.4 Computing Functions of RKHS Approximations 

More generally, consider approximations of kernel means u \X ] and ll\Y\ 


m' n! 

m ■■= E cc&M), fi[Y] := E ppytyj). (27) 

i= 1 j= 1 

In our case, we think of ([27]) as RKHS-norm approximations of the outcome of pre¬ 
vious operations performed on random variables. Such approximations typically have 
coefficients q£ 1" and /3 £ W" that are not uniform, that may not sum to one, and 
that may take negative values ll34l . e.g., for conditional mean maps |40l [91. 

We propose to approximate the kernel mean jj.[f(X,Y)] by the estimator 


mxj)] 


y,n' „.yn' o E E OiPMViJj)), 
Li= 1 Lj= 1 Pj i= 1 7=1 


(28) 


where the feature map <t>- defined on 3f, the range of /, may be different from both 
<f> A and <t>y. The expansion has m! ■ n' terms, which we can subsequently approximate 
more compactly in the form ( |25] >, ready for the next operation. Note that ( [28] ) contains 
( [23] ) as a special case. 

One of the advantages of our approach is that ( [23] ) and ( [28] ) apply for general data 
types. In other words, ,7’. 'W, 3T need not be vector spaces — they may be arbitrary 
nonempty sets, as long as positive definite kernels can be defined on them. 


Convergence analysis in an idealized setting We analyze the convergence of ( [28] ) 
under the assumption that the expansion points are actually samples xi,... ,x m from 
X and y i,... ,y n from Y, which is for instance the case if the expansions ( [27] ) are the 
result of reduced set selection methods (cf. Section |33] l. Moreover, we assume that the 
expansion coefficients cq,..., a,„ and j3i,..., j3„ are constants, i.e., independent of the 
samples. 

The following proposition gives a sufficient condition for the approximations in 
( [27] ) to converge. Note that below, the coefficients a\,...,a m depend on the sample 
size m, but for simplicity we refrain from writing them as a\ m , 0 £ mm ; likewise, for 
jSi,..., /3„. We make this remark to ensure that readers are not puzzled by the below 
statement that Y 4 L 1 —► 0 as m —)■ 


Proposition 1 Letx 1 ,... ,x m be an i.i.d. sample and (0 !,-)” l =l be constants with Y!JL\ a i = 
1. Assume E[k(X,X)] > E[k(X,X)], where X andX are independent copies ofxj. Then, 


the convergence 


E 


E a^Cxv) - M [ X} 

i= 1 


2 

->0 


(in —> 00 ) 


holds true if and only ifYH'=\ 70 as m —y 


Proof From the expansion 


E 


m 

E«,^> [xi)-n[X] 
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m m 

= I os,-a,sE[fc(xi,Xj)] — 21a,E[A:(x,',X)] + E[k(X,Z)] 

i,s= 1 i=l 

= ( 1 -£« i ) 2 E[A:(X,Z)] + (£a, 2 ){ E [ fc(X)Z )]_ E [ fc ( X)1 )]} ) 

i i 

the assertion is straightforward. 

The next result shows that if our approximations ( |27} converge in the sense of 
Proposition [I] then the estimator ([28} (with expansion coefficients summing to 1) is 
consistent. 


Theorem 3 Let xi,... ,x m and yi, • - - ,y„ be mutually independent i.i.d. samples, and 
the constants (j3/)j =1 satisfy Y!?=\ a i = Y!]= \ fij = 1- Assume Y,T =1 a f am ^ 

Y!)=\ Pj converge to zero as n,m —t Then 


EE oCiPMKwjV-iilrtlCJ)] 


<=t;=i 



as m,n —> «>. 


Proof By expanding and taking expectations, one can see that 


E 


11 aiPMfix^jV-EWfiXJ))} 

«=i ;=i 


equals 


£ I a ; 2 j3 ; 2 E[k(/(X,T),/(X,F))] + IIa ( a J i3 7 2 E[k(/(X,T),/(l,T))] 

<=i;=t sjti j 

+ II« ( 2 ME [k(f(X,Y),f(X,Y))] 

i tjij 

+11 a i a s PjP t E[k(f(X,Y),f{X,Y))] 

-2lIa 1 /3 J E[k(/(Z,7),/(X,F))]+E[k(/(Z,F),/(Z,?))] 

i j 

= (E«< 2 ) (I/3J)e [k{f{X,Y),f(X,Y))] 
i j 

l l j l j 

"E« ; 2 (E^) 2 -(Ia,) 2 I/3 2 lE[k(/(Z,F),/(Z,F))] 

i j i j J 

+ ((E«o 2 -E« 2 ) (LPj)nk(f(x,nf(x,m 
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+ (E« 2 ) ((LPj) 2 -LPj) E l k (f( x ’ Y )’fM)] 

V i j j 

= (E« 2 ) (1/3 l)E[k(f(X,Y),f(X,Y))] 

V i V j 

+ {E « 2 E # - E « 2 - E z 3 ; 2 } E [*(/(*. r),/(*> ?))] 

+ (3 -E« 2 ) (E^ ? ) E [W> y )>/(^> y ))] 

* j 

+ (E« 2 ) (i-E^ ? )w(^y)./(^.y))]. 

> j 

which implies the norm in the assertion of the theorem converges to zero at 
Op ^ \J Li a f + yE,/3j j under the assumptions on a; and /3y. Here (X,Y) is an inde¬ 
pendent copy of (X, K). This concludes the proof. 

Note that in the simplest case, where a,- = 1 /m and (ij = 1/n, we have af = 1 jm 
and ]_j j3 j = 1 /n, which proves Theorempl It is also easy to see from the proof that we 
do not strictly need ]_/ a, = E/A/ = 1 — for the estimator to be consistent, it suffices if 
the sums converge to 1. For a sufficient condition for this convergence, see CD- 

More general expansion sets To conclude our discussion of the estimator ( |28] i, we 
turn to the case where the expansions ([27]) are computed by reduced set construction, 
i.e., they are not necessarily expressed in terms of samples from X and Y. This is more 
difficult, and we do not provide a formal result, but just a qualitative discussion. 

To this end, suppose the approximations ( |27| ) satisfy 

m' 

a, = 1 and for all /, a, > o, (29) 

i=i 

r! 

[\j = 1 and for all /. /j ; > 0, (30) 

7=1 

and we approximate ji[f(X,Y)] by the quantity ( |2 8 [ >. 

We assume that ( [27] ) are good approximations of the kernel means of two unknown 
random variables X and Y ; we also assume that / and the kernel mean map along with 
its inverse are continuous. We have no samples from X and Y, but we can turn \21\ into 
sample estimates based on artificial samples X and Y, for which we can then appeal to 
our estimator from Theorem [2] 

To this end, denote by X' = (x\...., x! m ,) and Y' = (y'|,... ,y',) the expansion points 
in ( |27| . We construct a sample X = (xi ,X 2 , ■ ■ ■) whose kernel mean is close to Y!u=i < f > v (x' l ) 
as follows: for each i, the point x J i appears in X with multiplicity [m ■ a,j, i.e., the largest 
integer not exceeding m ■ (Xj. This leads to a sample of size at most m. Note, moreover, 
that the multiplicity of x-, divided by m, differs from a, by at most 1 jm, so effectively 
we have quantized the a, coefficients to this accuracy. 
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Since m! is constant, this implies that for any e > 0, we can choose m large enough 
to ensure that 


1 

m 


E •M**) - E 

1=1 i=i 


2 


< e. 


(31) 


We may thus work with j Y!i=i *&x(xi )> which for strictly positive definite kernels 
corresponds uniquely to the sample X = (xi,... ,x m ). By the same argument, we ob¬ 
tain a sample Y = (yi,.... y„) approximating the second expansion. Substituting both 
samples into the estimator from Theorem[2]leads to 


mxj)} 


r^EEWI-W)), 

= 1 «iLj=iP,7 1=12=1 


(32) 


where a, = [m ■ a,J/w, and = \_n ■ fij\/n. By choosing sufficiently large m,n, this 
becomes an arbitrarily good approximation (in the RKHS norm) of the proposed es¬ 
timator ( |28| >. Note, however, that we cannot claim based on this argument that this 
estimator is consistent, not the least since Theorem[2]in the stated form requires i.i.d. 
samples. 


Larger sets of random variables Without analysis, we include the estimator for the 
case of more than two variables: Let g be a measurable function of jointly independent 
RVs Uj (j = 1 Given i.i.d. observations u \,... ,m/„ ~ Uj, we have 

1 m . , 

- E u-..,U p )]. (33) 

m\,...,m p = 1 

in probability. Here, in order to keep notation simple, we have assumed that the samples 
sizes for each RV are identical. 

As above, we note that (i) g need not be real-valued, it can take values in some set 
for which we have a (possibly characteristic) positive definite kernel; (ii) we can 
extend this to general kernel expansions like ( [28] ); and (iii) if we use Gaussian kernels 
with width tending to 0, we can think of the above as a sampling method. 


4 Dependent RVs and Structural Equation Models 

For dependent RVs, the proposed estimators are not applicable. One way to handle 
dependent RVs is to appeal to the fact that any joint distribution of random variables 
can be written as a structural equation model with independent noises. This leads to an 
interesting application of our method to the field of causal inference. 

We consider a model Xj = /,(PA,,(/;), for i = 1 with jointly independent 

noise terms U\,... ,U p . Such models arise for instance in causal inference 1 1281 . Each 
random variable A, is computed as a function /j- of its noise term (/, and its parents PA, 
in an underlying directed acyclic graph (DAG). Every graphical model w.r.t. a DAG 
can be expressed as such a structural equation model with suitable functions and noise 
terms (e.g., ED). 
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If we recursively substitute the parent equations, we can express each X\ as a func¬ 
tion of only the independent noise terms U 1 ,..., U p , 

X i =g i (U 1 ,...,U p ). (34) 

Since we know how to compute functions of independent RVs, we can try to test such 
a model (assuming knowledge of all involved quantities) by estimating the distance 
between RKHS images, 

A=\\p[X i ]-p[g i (U 1 ,...,U p )]\\ 2 (35) 

using the estimator described in ( [33] ) (we discuss the bivariate case in Theorem |4j. 
It may be unrealistic to assume we have access to all quantities. However, there is a 
special case where this is conveivable, which we will presently discuss. This is the case 
of additive noise models |29l 


Y = f(X) + U, with XJLU. (36) 

Such models are of interest for cause-effect inference since it is known l29l that in the 
generic case, a model ( [36] ) can only be fit in one direction, i.e., if ( [36] ) holds true, then 
we cannot simultaneously have 


X = g(Y) + V , with Y XV. 

To measure how well ([36]) fits the data, we define an estimator 


A emp ' — 


i m i m 

- £ E *(/Of)+ u i ) 


i,i =i 


Analogously, we define the estimator in the backward direction 


A bw , = 
^emp * 


i m i m 

- L^i)- 2 L ®(g(yr)+Vj) 


i= 1 


i,j= 1 


(37) 


(38) 


(39) 


Here, we assume that we are given the conditional mean functions /: x E[F | X = x] 
and g : y i-)- E[X | Y = y]. 

In practice, we would apply the following procedure: we are given a sample (xi,yi), 
..., (x m ,y m ). We estimate the function / as well as the residual noise terms u\,...,u m 
by regression, and likewise for the backward function g [29]. Strictly speaking, we 
need to use separate subsamples to estimate function and noise terms, respectively, see 


Below, we show that A emp converges to 0 for additive noise models ( |3 6 [ > . For the 
incorrect model ( |37| ), however, Aj* will in the generic case not converge to zero. We 
can thus use the comparison of both values for deciding causal direction. 

Theorem 4 Suppose xi ,... ,x m and u \,..., u m are mutually independent i.i.d. samples, 
and }’i = f(xj) + Uj. Assume further that the direction of the additive noise model is 
identifiable m and the kernel for x is characteristic. We then have 

Aemp —> 0 and (40) 
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in probability as m —> °°. 


A emp -fr 0 


(41) 


Proof Equation ( |40| follows from Theorem I2I since |^ Jl'" j <t>(y,) — u\Y]\\ —)• 0 and 
1 2 

I Y.7j=\ < ^ > (f( x i) + Uj ) — /i [7] | —0 (all convergences in this proof are in probability). 
To prove ( |4l) . we assume that —>• 0 which implies 


Z2 E *(8(yt)+vj)-m 


i,j= 1 


(42) 


The key idea is to introduce a random variable V that has the same distribution as V but 
is independent of Y and to consider the following decomposition of the sum appearing 
in ( |42| >: 


1 m 

-2 E ^(s(yi)+vj) 

i,j =1 


, m 

ll^feW+Vi) 

772 z “j 


1 m m— 1 

+ — E E + v i+k) 

m rr ,, 

i=l k= 1 


1 1 

m m 


E°( x ') 

i= 1 






—: A m + Z? 


mi 


where the index for v is interpreted modulo m, for instance, v m+ 3 := V3. Since v,- + * = 
— g(y;+jr) is independent of y,, it further follows from Theorem [2] that || A m — 
in[X]\\ -A 0 and ||B m - ^H[g{Y)+ V ]|| -> 0. Therefore, 


A m + B m - ' p [X] - ——-/r[g(F) + V] 
m m 


0 . 


Together with (|42]) this implies 


r , 1 . , in — 1 . , . 

M[X]--Ai[X] -/x[g(r) + V] 


and therefore 

p[g(Y)+V\=p[X]. 

Since the kernel is characteristic, this implies 

g(Y) + V = X (in distribution), 

with f IV, which contradicts the identifiability of the additive noise model. 

As an aside, note that Theorem [4] would not hold if in ( |39| ) we were to estimate 
P[g(Y) + V] by hlULi^igiy^ + Vi) instead of ^ I% j=l ®(g(yi) + vj). 
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5 Experiments 

5.1 Synthetic data 

We consider basic arithmetic expressions that involve multiplication X ■ Y , division 
X/Y, and exponentiation X Y on two independent scalar RVs X and Y. Letting px = 
c/L(3,0.5) and p Y = c/L(4,0.5), we draw i.i.d. samples X = {xi,... ,x m } and Y = 
{yi,---,y m } from p x and p Y . 

In the experiment, we are interested in the convergence (in RKHS norm) of our 
estimators to p[f(X,Y)]. Since we do not have access to the latter, we use an indepen¬ 
dent sample to construct a proxy £[/(X,7)] = (1 /j =l <£> z (f(xi,yj)). We found 
that i = 100 led to a sufficiently good approximation. 

Next, we compare the three estimators, referred to as Hi,H 2 an d /I3 below, for 
sample sizes m ranging from 10 to 50: 

1. The sample-based estimator ( |23| 

2. The estimator ( |28j ) based on approximations of the kernel means, taking the form 
p[X] := ZT=i^x(xi) and £[Y] := EjLrfj&ybj) of p[X] and p[Y], respec¬ 
tively. We used the simplest possible reduced set selection method: we randomly 
subsampled subsets of size m! ss 0.4 • m from X and Y, and optimized the coef¬ 
ficients { (Xi ,..., a m i } and {£1,..., £„,<} to best approximate the original kernel 
means (based on X and Y) in the RKHS norm ll34l Section 18.3]. 

3. Analogously to the case of one variable 0. we may also look at the estimator 
£3 [X, Y] := (1 /m) ZT=i < J > ;(/( x o3 ; f))’ which sums only over m mutually indepen¬ 
dent terms, i.e., a small fraction of all terms of ( | 23 ] > . 

For i = 1,2,3, we evaluate the estimates £, [/(X,T)] using the error measure 

L=||£ ( [/(X,T)]-£[/(X,T)]|| 2 . (43) 

We use ([6]) to evaluate L in terms of kernels. In all cases, we employ a Gaussian RBF 
kernel |26]l whose bandwidth parameter is chosen using the median heuristic, setting cr 
to the median of the pairwise distances of distinct data points D2. 

Figure |T| depicts the error ( |43[ ) as a function of sample size m. For all operations, 
the error decreases as sample size increases. Note that <4>, is different across the three 
operations, resulting in different scales of the average error in Figure|T] 

5.2 Causal discovery via functions of kernel means 

We also apply our KPP approach to bivariate causal inference problem (cf. Section|4|. 
That is, given a pair of real-valued random variables X and Y with joint distribution 
pxy, we are interested in identifying whether X causes Y (denote as X —> Y) or Y 
causes X (denote as Y —» X) based on observational data. We assume an additive noise 
model E = f(C) + U with C _1L U where C.E.U denote cause, effect, and residual (or 
“unexplained”) variable, respectively. Below we present a preliminary result on the 
CauseEf f ectPairs benchmark data set m. 
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multiplication division exponentiation 



Figure 1: Error of the proposed estimators for three arithmetic operations — multipli¬ 
cation X ■ Y, division X/Y, and exponentiation X Y — as a function of sample size m. 
The error reported is an average of 30 repetitions of the simulations. The expensive 
estimator fl\ (see ( |23| ) performs best. The approximation /E (see ( [28] )) works well as 
sample sizes increase. 


For each causal pair (X,Y) = {{x\,yi(x m ,y m )}, we estimate functions y sa 
f(x) and x ss g(y) as least-squares fits using degree 4 polynomials. We illustrate one 
example in Figure [2a] Next, we compute the residuals in both directions as w,- = y, — 
f(xi) and Vj = xj — g(vj)(^| Finally, we compute scores A x^-y and A y^x by 


A x^>Y 


A Y-xX 


i m i m 

- £ -y £ Hf{ x d + u i) 

m m & t 

i m i m 

~ £ ^(*«0 Xp L + v j) 

m ... 

r=l 1 , 7 = 1 


Following Theorem[4] we can use the comparison between Xx--^y and &y--,x to infer the 
causal direction. Specifically, we decide that X —>■ Y if A x^>y < A y^-x- and that Y —> X 
otherwise. In this experiment, we also use a Gaussian RBF kernel whose bandwidth 
parameter is chosen using the median heuristic. To speed up the computation of A^-,k 
and A y^x, we adopted a finite approximation of the feature map using 100 random 
Fourier features (see ED for details). We allow the method to abstain whenever the 
two values are closer than 8 > 0. By increasing 5, we can compute the method’s 
accuracy as a function of a decision rate (i.e., the fraction of decisions that our method 
is forced to make) ranging from 100% to 0%. 

Figure [2b] depicts the accuracy versus the decision rate for the 81 pairs in the 
CauseEf f ectPairs benchmark collection. The method achieves an accuracy of 80%, 
which is significantly better than random guessing, when forced to infer the causal 
direction of all 81 pairs. 


'For simplicity, this was done using the same data; but cf. our discussion following j38[. 
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(a) Pair 78 and regressors f,g 



(b)the accuracy curve 


Figure 2: ([a| Scatter plot of the data of causal pair 78 in the CauseEffectPairs 
benchmarks, along with the forward and backward function fits, y = f(x) and x = g{y). 
|b]i Accuracy of cause-effect decisions on all the 81 pairs in the CauseEf f ectPairs 
benchmarks. 


6 Conclusions 

We have developed a kernel-based approach to compute functional operations on ran¬ 
dom variables taking values in arbitrary domains. We have proposed estimators for 
RKHS representations of those quantities, evaluated the approach on synthetic data, 
and showed how it can be used for cause-effect inference. While the results are en¬ 
couraging, the material presented in this article only describes the main ideas, and 
much remains to be done. We believe there is significant potential for a unified per¬ 
spective on probabilistic programming based on the described methods, and hope that 
some of the open problems will be addressed in future work. 


Acknowledgements Thanks to Dominik Janzing, Le Song and Ilya Tolstikhin for dicussions 
and comments. 
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